Gathering Data

sources:

Twitter-archive-enhanced.csv

Image-predication (download it from the server)

Twitter API (using tweet-json.txt)

Read from json file

Assessing Data

twitter_archive_df

Quality isssues:

Tidiness

image_predictions_df

Qaulity issues

Tidiness

tweet_df

Quality issues:

Tidiness

Cleaning Data

Making A copy of each dataframe

Quality issue

Define:

twitter_archive_df timestamp column has +0000. This should be removed and the time should be extracted out.

code:

Test:

Quality issue

Define:

twitter_archive_df timestamp column has dtpye as object. It should be changed to datetime dtype

code:

Test

Quality issue

Define:

twitter_archive_df source column shows the (href ..) link. The source should be exracted out of the link.

code

Test

Quality issue

Define:

Dogs name column has some rows with strange names. Name's examples (an, None, 0, a, Al, my, this, all, old, infuriating, the). It should be repalced so it will be replaced NaN.

code

Test

Quality issue

Define:

Original tweets only. Rows with retweets and replay should be removed

code

Test

Quality issue

Define:

Removing (in_reply_to_status_id, in_reply_to_user_id, retweeted_status_id, retweeted_status_user_id, retweeted_status_timestamp) related to replays and retweets, because only the original tweets to be kept.

code

Test

Tidiness

Define

twitter_archive_df, columns have None. It should be changed to NaN (replace None with NaN)

code

Test:

Quality issue

Define

some dogs appeare in more than one stage e.x. (doggo and floofer). These rows will be removed.

Code

Test

Tidiness issue

define

doggo, floofer, pupper and puppo should be represented in one column as dog_stage

code

Test

Quality issue

Define

image_predictionsdf columns(p1,p2,p3): showing first letter issue (capital and small letter) and underscore () need to be replaced with space

code

Test

Tidiness issue

Define:

image_predictions_df: This is to convert the result of predictions into one column breed_dog. Relaying on (p1_dog,P2_dog,p3_dog) if it's true it will take the name from the name related to this prediction, which can be found in (p1,p2,p3). breed_dog will take the name from the first true prediction.

code

Test

Tidiness issue

Define

(id, retweet_count and favorite_count) these the only columns that should be kept, other columns will be removed.

code

Test

Quality issue

Define:

Rename id column to be tweet_id

code

Test

Tidiness issue

Define:

Joining tables with twitter_archive_df_clean the main dataframe.Main dataframe will be created, which will contain the 3 dataframe

code

Test

Storing

Qualities and Tidiness Summary

Analyzing

Tweets Sources

98% percent of the tweets came from iphone

Dog names

Charlie is the most common dog's name

Favourite count and Retweets

As shown in the above charts and heatmap there is a strong correlation between favourite and retweet.

This is shows that retweets and favourite both increases during time. However, there is a big different in number, which keep increasing during time pass.

Dog Stage

Dog stages identified:

Due to the high number of pupper stage dogs, they are showing more number on breed dog

Dog stage (Retweet annd Favorite)